Proteomics, Post-translational Modifications, andIntegrative Analyses Reveal MolecularHeterogeneity within Medulloblastoma Subgroups

PROTEOMICS DATASET RE-ANALYSIS

Abstract

There is a pressing need to identify therapeutic targets in tumors with low mutation rates such as the malignant pediatric brain tumor medulloblastoma. To address this challenge, we quantitatively profiled global proteomes and phospho-proteomes of 45 medulloblastoma samples. Integrated analyses revealed that tumors with similar RNA expression vary extensively at the post-transcriptional and post-translational levels. We identified distinct pathways associated with two subsets of SHH tumors, and found post-translational modifications of MYC that are associated with poor outcomes in Group 3 tumors. We found kinases associated with subtypes and showed that inhibiting PRKDC sensitizes MYC-driven cells to radiation. Our study shows that proteomics enables a more comprehensive, functional readout, providing a foundation for future therapeutic strategies.

|experimental\_design| Figure 1: Summary of data types included in this study, depth of proteomic data types, and cohort composition.

Consensus clustering, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE) carried out separately on each data type predominantly revealed the known subgroups: Group 3, Group 4, and SHH (Figures ​(Figures2A,2A, S1-S3). The proteomic data also identified very stable subsets of two known subgroups. Here, we refer to Group 3 clusters as Group 3a (G3a) and Group 3b (G3b), and SHH clusters as SHHa and SHHb.

subgroups Figure 2: Comparison of clustering results.

REF: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6372116/

Objectives

Use CKG’s analytics core to:

  1. Re-analyze the proteomics dataset:

    • Preprocessing

    • Stratification looking at covariates: Age, Gender, Metastatic status and histology

    • Differential regulation including covariates

    • Functional enrichment

  2. Identify patients subgroups

  3. Investigate correlating PTMs on relevant protein complexes

Loading the Data

In this analysis we use:

  • Clinical metadata

  • RNA-sequencing

  • Proteomics

  • Phosphoproteomics

  • Phosphotyrosine proteomics

  • Acetylomics

This datasets have been previously formated (samples x proteins) and standardized to UniProt identifiers (data available in /assets) (see notebook Meduloblastoma Data Processing.ipynb).

[2]:
import pandas as pd
import numpy as np

from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from ckg.graphdb_builder import mapping
from ckg.graphdb_connector import connector

from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.0, the latest is 0.2.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.10, the latest is 0.3.11.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs
WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.
[3]:
clinical_data = pd.read_excel('../../assets/samples.xlsx', header=0)
rnaseq_data = pd.read_csv("../../assets/rnaseq.tsv", sep='\t', header=0)
proteomics_data = pd.read_csv("../../assets/proteomics.tsv", sep='\t', header=0)
phosphoSTY_data = pd.read_csv("../../assets/phosphoSTY.tsv", sep='\t', header=0)
phosphoY_data = pd.read_csv("../../assets/phosphoY.tsv", sep='\t', header=0)
acetylomics_data = pd.read_csv("../../assets/acetylomics.tsv", sep='\t', header=0)

1. Re-analyze the proteomics dataset

[4]:
proteomics_data.head()
[4]:
index Q6MZP0 Q8NF91 Q8WXH0 Q15149 Q15149.1 Q15149.2 Q15149.3 P58107 B4DTV0 ... A0A0A0MR82 Q9HD23 P78325 O43763 P28749 A0A024QZT7 Q96MC2 W5XKT8 Q5VZ18 group
0 MB018 1.068076 -0.696709 0.931591 2.420232 2.420232 2.436102 2.429754 1.690194 0.122202 ... NaN 0.620531 NaN -2.797946 NaN NaN NaN -0.290428 1.607669 GR3
1 MB095 -2.623296 1.433604 -2.755208 -0.515201 -0.522668 -0.515201 -0.522668 -0.398223 -0.983114 ... NaN NaN -2.658141 NaN NaN NaN -1.707383 NaN NaN GR3
2 MB106 -0.923848 -1.281532 -1.003560 -1.279489 -1.304015 -1.304015 -1.301972 0.318850 -2.362762 ... NaN -0.443529 NaN 0.339289 NaN NaN NaN -0.504846 -0.686754 GR3
3 MB170 -2.355968 1.185584 -2.055772 0.250797 0.252697 0.241297 0.246997 0.294496 0.203297 ... NaN NaN -4.029845 NaN NaN NaN -0.075999 NaN NaN GR3
4 MB226 0.077085 -3.470951 -0.535310 -1.351123 -1.353264 -1.359688 -1.359688 -1.571671 -2.201195 ... NaN -0.668067 NaN 0.246243 1.181965 NaN NaN NaN NaN GR3

5 rows × 13127 columns

Preprocessing:

  1. Remove WNT samples (low number of samples)

  2. Filter proteins with more than 80% missing values in all groups

  3. Normalize data using median-centered normalization

  4. Impute missing values using a mixed model: K-Nearest Neighbors algorithm (KNN) if at least 60% valid values, otherwise Minimum Probability (MinProb) drawing random values from a down-shifted Gaussian distribution

  5. Include Protein name in columns using CKG’s mapping functionality

  6. Include relevant clinical features: Age, Gender, Metastatic status, histology

[5]:
proteomics_data = proteomics_data[proteomics_data['group'] != 'WNT']
[6]:
valid_proteins = analytics.extract_percentage_missing(proteomics_data, missing_max=0.2, drop_cols=['index'], group='group', how='all')
proteomics_data = proteomics_data[valid_proteins + ['index', 'group']]
[7]:
proteomics_data = analytics.normalize_data(proteomics_data, method='median', normalize='samples')

[8]:
proteomics_data = analytics.imputation_mixed_norm_KNN(proteomics_data, index_cols=['index','group']).reset_index()
proteomics_data = proteomics_data.rename(columns={'index':'sample'})
[9]:
m = mapping.getMappingFromDatabase(proteomics_data.columns, node="Protein", attribute_from='id', attribute_to='name')
columns = [m[c]+"~"+c if c in m and m[c] is not None else c for c in proteomics_data.columns]
proteomics_data.columns = columns
[10]:
proteomics_data.head()
[10]:
sample group FLJ10292~A0A023T6R1 RBM8~A0A023T787 A0A024QYR3 TM9SF2~A0A024QYR8 A0A024QYS2 AP1S1~A0A024QYT6 EIF3S8~A0A024QYU9 RABL5~A0A024QYV3 ... PMF1-BGLAP~U3KQ54 CSNK1G1~U3KQB3 LSM4~U3KQK1 HEL-76~V9HW21 MPND~W4VSR2 YP_003024026 YP_003024028 YP_003024029 YP_003024030 YP_003024036
0 MB018 GR3 0.305789 0.121693 -0.306807 -1.420907 -1.992240 0.080430 -0.586125 -0.170322 ... 1.010434 -0.398855 0.362923 2.222930 0.454971 0.169304 -3.687196 -1.779577 0.673982 1.286578
1 MB095 GR3 0.161486 0.004686 -0.117270 0.385487 0.106731 -0.577716 0.938022 0.397931 ... 0.440243 1.082378 0.149042 -0.338782 1.171978 1.405934 0.320776 1.109756 -0.313893 1.475623
2 MB106 GR3 0.246131 0.722363 0.252263 0.248175 0.037652 1.270131 0.865436 0.188902 ... 1.186331 -0.450842 -0.465150 -2.932151 -0.575521 0.571113 -0.064543 0.368766 -1.517764 1.302834
3 MB170 GR3 0.158128 0.141028 -1.245953 -1.109155 -1.055955 -1.152854 0.448824 -0.016670 ... 0.942817 -0.839358 0.255027 1.818705 0.072629 -0.073669 -0.537262 -1.215553 0.564722 -0.225667
4 MB226 GR3 0.829814 0.617831 -0.223676 -0.799670 -0.551286 -0.803952 0.647809 -0.435659 ... 1.281616 0.088945 -0.294337 0.016143 -0.225818 -0.465636 -0.119610 -1.510562 0.589995 -0.559851

5 rows × 11137 columns

[11]:
    proteomics_data = proteomics_data.set_index(['sample', 'group']).join(clinical_data.set_index(['sample', 'group'])).reset_index()
    proteomics_data = proteomics_data.drop('Proteome-based ', axis=1)
    proteomics_data.loc[proteomics_data['Metastatic  Status'].isna(), 'Metastatic  Status'] = 'M3'
[12]:
proteomics_data.head()
[12]:
sample group FLJ10292~A0A023T6R1 RBM8~A0A023T787 A0A024QYR3 TM9SF2~A0A024QYR8 A0A024QYS2 AP1S1~A0A024QYT6 EIF3S8~A0A024QYU9 RABL5~A0A024QYV3 ... MPND~W4VSR2 YP_003024026 YP_003024028 YP_003024029 YP_003024030 YP_003024036 Age At Diagnosis (years) Gender Metastatic Status Histology
0 MB018 GR3 0.305789 0.121693 -0.306807 -1.420907 -1.992240 0.080430 -0.586125 -0.170322 ... 0.454971 0.169304 -3.687196 -1.779577 0.673982 1.286578 13.0 F M0 LCA
1 MB095 GR3 0.161486 0.004686 -0.117270 0.385487 0.106731 -0.577716 0.938022 0.397931 ... 1.171978 1.405934 0.320776 1.109756 -0.313893 1.475623 3.0 M M0 classic
2 MB106 GR3 0.246131 0.722363 0.252263 0.248175 0.037652 1.270131 0.865436 0.188902 ... -0.575521 0.571113 -0.064543 0.368766 -1.517764 1.302834 3.0 M M3 LCA
3 MB170 GR3 0.158128 0.141028 -1.245953 -1.109155 -1.055955 -1.152854 0.448824 -0.016670 ... 0.072629 -0.073669 -0.537262 -1.215553 0.564722 -0.225667 16.0 M M3 LCA
4 MB226 GR3 0.829814 0.617831 -0.223676 -0.799670 -0.551286 -0.803952 0.647809 -0.435659 ... -0.225818 -0.465636 -0.119610 -1.510562 0.589995 -0.559851 5.0 M M3 classic

5 rows × 11141 columns

Sample Stratification

We use Principal Component Analysis to visualize whether the groups cluster together.

[13]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Gender', 'Metastatic  Status', 'Age At Diagnosis (years)', 'Histology'], group='group', annotation_cols=['sample'], components=2)


[14]:
args.update({'title':"PCA Medulloblastoma groups", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_groups", plot_format='png', directory='.')

We can also color based on possible covariates: Age, Gender, Metastatic status and histology

[15]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group', 'Metastatic  Status', 'group', 'Gender', 'Histology'], group='Age At Diagnosis (years)', annotation_cols=['sample'], components=2)


[16]:
args.update({'title':"PCA Medulloblastoma by Age", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_age", plot_format='png', directory='.')
[17]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Age At Diagnosis (years)', 'group', 'Metastatic  Status', 'Histology'], group='Gender', annotation_cols=['sample'], components=2)


[ ]:

[18]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group','Age At Diagnosis (years)' , 'Gender', 'Histology'], group='Metastatic  Status', annotation_cols=['sample'], components=2)


[19]:
args.update({'title':"PCA Medulloblastoma by Metastatic Status", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_metastatics", plot_format='png', directory='.')
[20]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group', 'Metastatic  Status', 'group', 'Gender', 'Age At Diagnosis (years)'], group='Histology', annotation_cols=['sample'], components=2)


[21]:
args.update({'title':"PCA Medulloblastoma by Histology", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_histology", plot_format='png', directory='.')

We can also stratify looking at biological processes (GO terms) common to the studied cohort. We use single-sample GSEA (REF:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2783335/) to project each sample onto a 2D space of normalized gene set enrichment scores.

We extract the Gene Ontology annotation from CKG’s graph database using the connector module.

[22]:
go_terms_query = "MATCH (p:Protein)-[]-(bp:Biological_process) WHERE (p.name+'~'+p.id) IN {} RETURN DISTINCT (p.name+'~'+p.id) AS identifier,bp.name AS annotation"
go_terms_query = go_terms_query.format(proteomics_data.columns.tolist())
annotation = connector.run_query(go_terms_query)
[23]:
annotation.head()
[23]:
annotation identifier
0 mitochondrial genome maintenance AKT3~Q9Y243
1 mitochondrial genome maintenance MGME1~Q9BQP7
2 mitochondrial genome maintenance SLC25A33~Q9BSK2
3 mitochondrial genome maintenance LONP1~P36776
4 mitochondrial genome maintenance OPA1~O60313
[24]:
ssgsea_data = analytics.run_ssgsea(data=proteomics_data.drop(['Metastatic  Status', 'Gender', 'Histology', 'Age At Diagnosis (years)'], axis=1), annotation=annotation, annotation_col='annotation',
                     identifier_col='identifier', set_index=['group', 'sample'], outdir=None,
                     min_size=10, scale=False, permutations=0)
[25]:
pca, args = analytics.run_pca(ssgsea_data['nes'], drop_cols=[ ], group='group', annotation_cols=['sample'], components=2)
[26]:
args.update({'title':"Functional PCA Medulloblastoma groups", 'loadings':25, 'factor': 15})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="Functional_PCA_Medulloblastoma_groups", plot_format='png', directory='.')

Differential Regulation

We perform ANCOVA analysis including covariates: Age, Gender, Metastatic status and Histology.

[27]:
pro_anova_result = analytics.run_ancova(proteomics_data, covariates=['Age At Diagnosis (years)', 'Gender', 'Histology', 'Metastatic  Status'], drop_cols=['sample'], subject=None, permutations=0)
[28]:
pro_anova_result.shape
[28]:
(33405, 23)
[29]:
anova_significant = pro_anova_result[pro_anova_result.rejected & (pro_anova_result['posthoc padj'] <= 0.01)]
[30]:
anova_significant['identifier'].unique().shape
[30]:
(2590,)
[31]:
volcano_plot = viz.run_volcano(pro_anova_result, identifier='proteomics_volcanos',
            args={'alpha':0.01, 'fc':2, 'num_annotations': 80,
                  'colorscale':'Blues', 'showscale': False,
                  'marker_size':8, 'x_title':'log2FC', 'y_title':'-log10(pvalue)'})
i = 0
for plot in volcano_plot:
    iplot(plot.figure)
    viz.save_DASH_plot(plot, name="volcano_plot_"+str(i), plot_format='png', directory='.')
    i += 1

Functional Enrichment Analysis

[32]:
enrichment_results = analytics.run_up_down_regulation_enrichment(pro_anova_result, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)
[ ]:

[33]:
figures = viz.get_enrichment_plots(enrichment_results, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
    iplot(fig.figure)
    viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory='.')
    i += 1